D² Tweedie Score (d2_tweedie_score)#

d2_tweedie_score is a relative-to-baseline regression score:

“What fraction of the Tweedie deviance did my model explain compared to just predicting the mean?”

It generalizes \(R^2\) to non-Gaussian targets (counts, positive skewed data, and zero-mass + continuous data) via the Tweedie family.


Learning goals#

By the end you should be able to:

  • define \(D^2\) as “fraction of Tweedie deviance explained”

  • understand the power parameter \(p\) and the allowed target/prediction domains

  • implement mean_tweedie_deviance and d2_tweedie_score from scratch in NumPy (incl. sample weights)

  • visualize how different \(p\) values change the penalty for under/over-prediction

  • optimize a simple Poisson regression (a Tweedie GLM with \(p=1\)) and track \(D^2\) during training

Quick import#

from sklearn.metrics import d2_tweedie_score

Prerequisites#

  • logs and exponentials

  • weighted averages

  • basic regression notation (\(y\), \(\hat{y}\))

  • (optional) GLMs / deviance

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import d2_tweedie_score as sk_d2_tweedie_score
from sklearn.metrics import mean_tweedie_deviance as sk_mean_tweedie_deviance

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Definition: “fraction of Tweedie deviance explained”#

Given targets \(y_1,\dots,y_n\) and predictions \(\hat{\mu}_1,\dots,\hat{\mu}_n\), define the (possibly weighted) mean Tweedie deviance:

\[ \bar{d}_p(y,\hat{\mu}) = \frac{1}{\sum_i w_i}\sum_{i=1}^n w_i\, d_p(y_i, \hat{\mu}_i), \qquad w_i \ge 0 \]

The \(D^2\) Tweedie score compares your model to the constant baseline that predicts the (weighted) mean:

\[ \bar{y}_w = \frac{\sum_i w_i y_i}{\sum_i w_i}, \qquad \hat{\mu}^{(0)}_i = \bar{y}_w \]
\[ D^2_p = 1 - \frac{\bar{d}_p(y,\hat{\mu})}{\bar{d}_p(y,\hat{\mu}^{(0)})} \]

Interpretation (just like \(R^2\)):

  • \(D^2_p = 1\): perfect predictions (zero deviance)

  • \(D^2_p = 0\): no improvement over predicting the mean

  • \(D^2_p < 0\): worse than the mean baseline

Special case: \(p=0\) uses squared-error deviance, so \(D^2_0\) equals \(R^2\).

# Small example: same y_true, different predictors, different powers

y_true = np.array([0.5, 1.0, 2.5, 7.0])

preds = {
    "Mean baseline": np.full_like(y_true, y_true.mean()),
    "Example model": np.array([1.0, 1.0, 5.0, 3.5]),
    "Perfect": y_true.copy(),
    "Bad constant": np.full_like(y_true, 10.0),
}

powers = [0, 1, 2]

fig = go.Figure()
for p in powers:
    scores = [sk_d2_tweedie_score(y_true, y_pred, power=p) for y_pred in preds.values()]
    fig.add_trace(go.Bar(name=f"power={p}", x=list(preds.keys()), y=scores))

fig.update_layout(
    title="D² Tweedie score across different powers (same data, same predictions)",
    barmode="group",
    xaxis_title="Predictor",
    yaxis_title="D² (higher is better)",
    width=950,
    height=420,
)
fig.show()

2) Tweedie family + Tweedie deviance (the loss behind \(D^2\))#

The Tweedie family is an exponential-dispersion model where the variance scales with the mean:

\[ \mathrm{Var}(Y) = \phi\,\mu^p \]

Common cases:

\(p\)

Distribution

Typical target domain

\(0\)

Normal

\(y\in\mathbb{R}\)

\(1\)

Poisson

\(y\ge 0\) (counts)

\(1<p<2\)

Compound Poisson–Gamma

\(y\ge 0\) (mass at 0 + continuous \(>0\))

\(2\)

Gamma

\(y>0\)

\(3\)

Inverse Gaussian

\(y>0\)

In a GLM, we model a mean \(\mu_i = \mathbb{E}[y_i\mid x_i]\). To keep \(\mu_i>0\) we often use a log link:

\[ \eta_i = x_i^\top \beta,\qquad \mu_i = \exp(\eta_i) \]

Unit deviance \(d_p(y,\mu)\)#

mean_tweedie_deviance is the average of a per-sample unit deviance \(d_p(y,\mu)\) (with the convention \(0\log 0 = 0\)).

Special cases:

\[ d_0(y,\mu) = (y-\mu)^2 \]
\[ d_1(y,\mu) = 2\left(y\log\frac{y}{\mu} - y + \mu\right) \]
\[ d_2(y,\mu) = 2\left(\log\frac{\mu}{y} + \frac{y}{\mu} - 1\right) \]

General case (\(p \notin \{0,1,2\}\)):

\[ d_p(y,\mu)= 2\left[\frac{y^{2-p}}{(1-p)(2-p)} - \frac{y\mu^{1-p}}{1-p} + \frac{\mu^{2-p}}{2-p}\right] \]

Domain requirements (matching scikit-learn):

  • \(p < 0\): \(\mu>0\) (and the formula uses \(\max(y,0)^{2-p}\) to avoid non-real powers)

  • \(p = 0\): \(y,\mu\in\mathbb{R}\)

  • \(1 \le p < 2\): \(y\ge 0\), \(\mu>0\)

  • \(p \ge 2\): \(y>0\), \(\mu>0\)

Derivative (useful for optimization)#

For all \(p\) (via limits at \(p=0,1,2\)):

\[ \frac{\partial d_p(y,\mu)}{\partial \mu} = 2\left(\mu^{1-p} - y\mu^{-p}\right) = 2\mu^{-p}(\mu - y) \]

So \(d_p\) is minimized at \(\mu=y\), but the curvature / asymmetry depends on \(p\).

def _xlogy(x, y):
    """Compute x * log(y) with the convention 0 * log(y) = 0."""

    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)
    bx, by = np.broadcast_arrays(x, y)

    out = np.zeros_like(bx, dtype=float)
    mask = bx != 0
    out[mask] = bx[mask] * np.log(by[mask])
    return out


def _validate_power(power: float) -> float:
    p = float(power)
    if not (p <= 0 or p >= 1):
        raise ValueError(f"power must satisfy p <= 0 or p >= 1, got p={p}")
    return p


def _check_1d_targets(y_true, y_pred, sample_weight=None):
    y_true = np.asarray(y_true, dtype=float).reshape(-1)
    y_pred = np.asarray(y_pred, dtype=float).reshape(-1)

    if y_true.shape != y_pred.shape:
        raise ValueError(f"y_true and y_pred must have the same shape, got {y_true.shape} vs {y_pred.shape}")

    if sample_weight is None:
        return y_true, y_pred, None

    w = np.asarray(sample_weight, dtype=float).reshape(-1)
    if w.shape[0] != y_true.shape[0]:
        raise ValueError(f"sample_weight must have shape (n_samples,), got {w.shape}")
    if np.any(w < 0):
        raise ValueError("sample_weight must be non-negative")
    if float(np.sum(w)) == 0.0:
        raise ValueError("Sum of sample_weight must be > 0")

    return y_true, y_pred, w


def tweedie_unit_deviance_numpy(y_true, y_pred, *, power: float = 0.0) -> np.ndarray:
    """Per-sample Tweedie unit deviance d_p(y, μ) (NumPy).

    This matches scikit-learn's conventions and input domain checks.
    """

    y_true = np.asarray(y_true, dtype=float).reshape(-1)
    y_pred = np.asarray(y_pred, dtype=float).reshape(-1)
    if y_true.shape != y_pred.shape:
        raise ValueError(f"y_true and y_pred must have the same shape, got {y_true.shape} vs {y_pred.shape}")

    p = _validate_power(power)

    message = f"Mean Tweedie deviance error with power={p} can only be used on "
    if p < 0:
        # 'Extreme stable', y any real number, y_pred > 0
        if np.any(y_pred <= 0):
            raise ValueError(message + "strictly positive y_pred.")
    elif p == 0:
        # Normal, y and y_pred can be any real number
        pass
    elif 1 <= p < 2:
        # Poisson and compound Poisson distribution, y >= 0, y_pred > 0
        if np.any(y_true < 0) or np.any(y_pred <= 0):
            raise ValueError(message + "non-negative y and strictly positive y_pred.")
    elif p >= 2:
        # Gamma / Inverse Gaussian / positive stable, y and y_pred > 0
        if np.any(y_true <= 0) or np.any(y_pred <= 0):
            raise ValueError(message + "strictly positive y and y_pred.")

    zero = 0.0
    if p < 0:
        # Prevent non-real powers for negative targets
        y_pos = np.where(y_true > 0, y_true, zero)
        dev = 2 * (
            np.power(y_pos, 2 - p) / ((1 - p) * (2 - p))
            - y_true * np.power(y_pred, 1 - p) / (1 - p)
            + np.power(y_pred, 2 - p) / (2 - p)
        )
    elif p == 0:
        dev = (y_true - y_pred) ** 2
    elif p == 1:
        dev = 2 * (_xlogy(y_true, y_true / y_pred) - y_true + y_pred)
    elif p == 2:
        dev = 2 * (np.log(y_pred / y_true) + y_true / y_pred - 1)
    else:
        dev = 2 * (
            np.power(y_true, 2 - p) / ((1 - p) * (2 - p))
            - y_true * np.power(y_pred, 1 - p) / (1 - p)
            + np.power(y_pred, 2 - p) / (2 - p)
        )

    return dev


def mean_tweedie_deviance_numpy(y_true, y_pred, *, power: float = 0.0, sample_weight=None) -> float:
    """Mean Tweedie deviance (NumPy)."""

    y_true, y_pred, w = _check_1d_targets(y_true, y_pred, sample_weight)
    dev = tweedie_unit_deviance_numpy(y_true, y_pred, power=power)

    if w is None:
        return float(np.mean(dev))
    return float(np.average(dev, weights=w))


def d2_tweedie_score_numpy(y_true, y_pred, *, power: float = 0.0, sample_weight=None) -> float:
    """NumPy implementation of scikit-learn's d2_tweedie_score.

    Notes
    -----
    - Best score is 1.0; it can be negative.
    - Returns NaN if n_samples < 2.
    - Can also return NaN if the baseline deviance is 0 (e.g., constant targets).
    """

    y_true, y_pred, w = _check_1d_targets(y_true, y_pred, sample_weight)
    if y_true.shape[0] < 2:
        return float("nan")

    numerator = mean_tweedie_deviance_numpy(y_true, y_pred, power=power, sample_weight=w)

    y_avg = float(np.average(y_true, weights=w)) if w is not None else float(np.mean(y_true))
    denominator = mean_tweedie_deviance_numpy(
        y_true,
        np.full_like(y_true, y_avg),
        power=power,
        sample_weight=w,
    )

    with np.errstate(divide="ignore", invalid="ignore"):
        return float(1.0 - numerator / denominator)
# Quick checks vs scikit-learn (should match exactly)

for p in [0, 1, 1.5, 2, 3, -0.5]:
    if p == 0:
        y_true = rng.normal(size=200)
        y_pred = y_true + rng.normal(scale=0.3, size=200)
    elif p < 0:
        y_true = rng.normal(size=200)
        y_pred = np.exp(rng.normal(size=200))
    elif 1 <= p < 2:
        y_true = rng.poisson(3.0, size=200).astype(float)
        y_pred = np.exp(rng.normal(size=200))
    else:
        y_true = rng.gamma(shape=2.0, scale=2.0, size=200)
        y_pred = np.exp(rng.normal(size=200))

    w = rng.uniform(0.2, 2.0, size=200)

    sk_loss = sk_mean_tweedie_deviance(y_true, y_pred, power=p, sample_weight=w)
    np_loss = mean_tweedie_deviance_numpy(y_true, y_pred, power=p, sample_weight=w)

    sk_d2 = sk_d2_tweedie_score(y_true, y_pred, power=p, sample_weight=w)
    np_d2 = d2_tweedie_score_numpy(y_true, y_pred, power=p, sample_weight=w)

    print(f"power={p:>4}: mean dev diff={abs(sk_loss-np_loss):.3e} | D² diff={abs(sk_d2-np_d2):.3e}")
power=   0: mean dev diff=0.000e+00 | D² diff=0.000e+00
power=   1: mean dev diff=0.000e+00 | D² diff=0.000e+00
power= 1.5: mean dev diff=0.000e+00 | D² diff=0.000e+00
power=   2: mean dev diff=0.000e+00 | D² diff=0.000e+00
power=   3: mean dev diff=0.000e+00 | D² diff=0.000e+00
power=-0.5: mean dev diff=0.000e+00 | D² diff=0.000e+00

3) Intuition: how \(p\) changes the penalty#

Two important intuitions:

  1. Asymmetry for \(p\ge 1\): underpredicting towards \(\mu\to 0^+\) can be extremely costly when \(y>0\).

  2. Mean-dependent scaling: the derivative

\[ \frac{\partial d_p(y,\mu)}{\partial \mu} = 2\mu^{-p}(\mu-y) \]

shows that (for \(p>0\)) errors at larger \(\mu\) are downweighted by \(\mu^{-p}\). This is the idea behind using Tweedie losses when variance grows with the mean.

Below are deviance curves (lower is better). Notice how the shape changes with \(p\).

# Unit deviance curves for different powers

mu_pos = np.linspace(1e-3, 15, 500)
y_fixed = 5.0
powers_main = [0, 1, 1.5, 2, 3]

mu_zero = np.linspace(1e-6, 10, 500)
y_zero = 0.0
powers_zero = [0, 1, 1.5]

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("y = 5 (compare powers)", "y = 0 (compare how predicting >0 is penalized)"),
)

for p in powers_main:
    dev = tweedie_unit_deviance_numpy(np.full_like(mu_pos, y_fixed), mu_pos, power=p)
    fig.add_trace(go.Scatter(x=mu_pos, y=dev, mode="lines", name=f"p={p}"), row=1, col=1)

for p in powers_zero:
    dev = tweedie_unit_deviance_numpy(np.full_like(mu_zero, y_zero), mu_zero, power=p)
    fig.add_trace(go.Scatter(x=mu_zero, y=dev, mode="lines", name=f"p={p}"), row=1, col=2)

fig.add_vline(x=y_fixed, line_dash="dash", line_color="gray", row=1, col=1)

fig.update_xaxes(title_text="Predicted mean μ", row=1, col=1)
fig.update_yaxes(title_text="Unit deviance d_p(y,μ)", row=1, col=1)
fig.update_xaxes(title_text="Predicted mean μ", row=1, col=2)
fig.update_yaxes(title_text="Unit deviance d_p(y,μ)", row=1, col=2)

fig.update_layout(
    title="Tweedie unit deviance curves",
    width=1100,
    height=450,
    legend_title_text="power",
)
fig.show()
# D² as a function of multiplicative mis-calibration of the predicted mean

n = 600
x = rng.normal(size=n)
X = np.c_[np.ones(n), x]

beta_true = np.array([0.2, 0.7])
mu_true = np.exp(X @ beta_true)
y = rng.poisson(mu_true)

scales = np.linspace(0.2, 3.0, 90)
powers = [0, 1, 1.5]

fig = go.Figure()
for p in powers:
    d2_vals = [d2_tweedie_score_numpy(y, mu_true * s, power=p) for s in scales]
    fig.add_trace(go.Scatter(x=scales, y=d2_vals, mode="lines", name=f"power={p}"))

fig.add_hline(y=0, line_dash="dash", line_color="gray")

fig.update_layout(
    title="D² vs multiplicative mis-calibration of μ (same y_true, same mu_true)",
    xaxis_title="Scale factor s in μ_pred = s · μ_true",
    yaxis_title="D² (higher is better)",
    width=950,
    height=450,
)
fig.show()

4) Using \(D^2\) while optimizing a model (from scratch)#

In

\[ D^2_p = 1 - \frac{\bar{d}_p(y,\hat{\mu})}{\bar{d}_p(y,\hat{\mu}^{(0)})}, \]

the denominator depends only on \(y\) (and weights), so during training it is a constant.

Maximizing \(D^2_p\) is therefore equivalent to minimizing the mean Tweedie deviance \(\bar{d}_p(y,\hat{\mu})\).

We’ll fit a simple Poisson regression (Tweedie with \(p=1\)) using a log link:

\[ \eta = X\beta,\qquad \mu = \exp(\eta) \]

Using the derivative from above and the chain rule, for general \(p\):

\[ \frac{\partial}{\partial \eta_i} d_p(y_i,\mu_i) = \frac{\partial d_p}{\partial \mu_i}\frac{\partial \mu_i}{\partial \eta_i} = 2\left(\mu_i^{2-p} - y_i\mu_i^{1-p}\right) \]

For Poisson (\(p=1\)): \(\frac{\partial}{\partial \eta_i} d_1 = 2(\mu_i - y_i)\).

# Fit a Poisson GLM with gradient descent and track both deviance and D²

n = 400
x = rng.uniform(-2.0, 2.0, size=n)
X = np.c_[np.ones(n), x]

beta_true = np.array([0.3, 0.8])
mu_true = np.exp(X @ beta_true)
y = rng.poisson(mu_true)


def fit_tweedie_glm_gd(X, y, *, power=1.0, lr=0.05, n_iter=3000, record_every=10):
    """Gradient descent for a Tweedie GLM with log link: μ = exp(Xβ)."""

    p = float(power)
    n_samples, n_features = X.shape
    beta = np.zeros(n_features, dtype=float)

    hist = {"iter": [], "loss": [], "d2": []}
    for t in range(n_iter + 1):
        eta = X @ beta
        mu = np.exp(np.clip(eta, -20, 20))

        if t % record_every == 0:
            hist["iter"].append(t)
            hist["loss"].append(mean_tweedie_deviance_numpy(y, mu, power=p))
            hist["d2"].append(d2_tweedie_score_numpy(y, mu, power=p))

        # ∂/∂η d_p(y, μ) = 2(μ^{2-p} - y μ^{1-p}) for μ = exp(η)
        g_eta = 2.0 * (mu ** (2.0 - p) - y * (mu ** (1.0 - p)))
        grad = (X.T @ g_eta) / n_samples
        beta -= lr * grad

    return beta, hist


beta_hat, hist = fit_tweedie_glm_gd(X, y, power=1.0, lr=0.05, n_iter=3000, record_every=10)
beta_hat

fig = make_subplots(rows=1, cols=2, subplot_titles=("Mean Poisson deviance", "D² (power=1)"))
fig.add_trace(go.Scatter(x=hist["iter"], y=hist["loss"], mode="lines", name="loss"), row=1, col=1)
fig.add_trace(go.Scatter(x=hist["iter"], y=hist["d2"], mode="lines", name="D²"), row=1, col=2)
fig.add_hline(y=0, line_dash="dash", line_color="gray", row=1, col=2)

fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_yaxes(title_text="Mean deviance", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=2)
fig.update_yaxes(title_text="D²", row=1, col=2)

fig.update_layout(width=1050, height=420, title="Training diagnostics")
fig.show()
# Visualize the fitted mean function μ(x)

mu_hat = np.exp(np.clip(X @ beta_hat, -20, 20))

order = np.argsort(x)
x_s = x[order]
y_s = y[order]
mu_true_s = mu_true[order]
mu_hat_s = mu_hat[order]

mu_baseline = np.full_like(mu_hat_s, y.mean())

fig = go.Figure()
fig.add_trace(go.Scatter(x=x_s, y=y_s, mode="markers", name="y (counts)", opacity=0.35))
fig.add_trace(go.Scatter(x=x_s, y=mu_true_s, mode="lines", name="true μ(x)", line=dict(width=3)))
fig.add_trace(go.Scatter(x=x_s, y=mu_hat_s, mode="lines", name="fitted μ(x)", line=dict(width=3)))
fig.add_trace(go.Scatter(x=x_s, y=mu_baseline, mode="lines", name="mean baseline", line=dict(dash="dash")))

fig.update_layout(
    title="Poisson regression fit (log-link): fitted mean vs true mean",
    xaxis_title="feature x",
    yaxis_title="count / mean",
    width=950,
    height=450,
)
fig.show()

5) Practical usage in scikit-learn#

The score itself is just:

from sklearn.metrics import d2_tweedie_score
d2 = d2_tweedie_score(y_true, y_pred, power=p)

For modeling, scikit-learn provides GLM-style estimators such as PoissonRegressor, GammaRegressor, and the general TweedieRegressor.

Below is a minimal Poisson example (power=1).

from sklearn.linear_model import PoissonRegressor
from sklearn.model_selection import train_test_split

X_feat = x.reshape(-1, 1)
X_tr, X_te, y_tr, y_te = train_test_split(X_feat, y, test_size=0.3, random_state=0)

model = PoissonRegressor(alpha=0.0, max_iter=1000)
model.fit(X_tr, y_tr)
mu_te = model.predict(X_te)

print("sklearn PoissonRegressor D² (power=1):", sk_d2_tweedie_score(y_te, mu_te, power=1))

# Compare to our gradient descent fit on the same split
X_tr_i = np.c_[np.ones(X_tr.shape[0]), X_tr[:, 0]]
X_te_i = np.c_[np.ones(X_te.shape[0]), X_te[:, 0]]
beta_hat_tr, _ = fit_tweedie_glm_gd(X_tr_i, y_tr, power=1.0, lr=0.05, n_iter=3000, record_every=50)
mu_te_gd = np.exp(np.clip(X_te_i @ beta_hat_tr, -20, 20))

print("NumPy GD D² (power=1):", d2_tweedie_score_numpy(y_te, mu_te_gd, power=1))
sklearn PoissonRegressor D² (power=1): 0.5782287829762082
NumPy GD D² (power=1): 0.5782287833631345

6) Pros, cons, and when to use#

Pros

  • \(D^2\) is interpretable like \(R^2\): 1 is best, 0 is “mean baseline”, negative means worse than baseline.

  • It is distribution-aware via deviance, making it a natural score for GLMs.

  • Works well for heteroscedastic targets where variance grows with the mean (counts, positive skew, etc.).

  • Supports sample weights.

Cons / pitfalls

  • You must choose a power \(p\); the value (and model ranking) can change with \(p\).

  • Domain restrictions: for most \(p\ge 1\) you need \(y\ge 0\) and strictly positive predictions \(\hat{\mu}>0\).

  • Not defined for \(n<2\) samples; can be NaN when the baseline deviance is 0 (e.g., constant targets).

  • It is a relative score, not an absolute error, and is not symmetric in its arguments.

Good use cases

  • \(p=1\) (Poisson): counts / event rates.

  • \(1<p<2\) (compound Poisson–Gamma): data with many zeros plus positive continuous outcomes (common in insurance claims).

  • \(p=2\) (Gamma) or \(p=3\) (Inverse Gaussian): strictly positive, right-skewed continuous targets.

7) Exercises#

  1. For a fixed \(y>0\), plot \(d_p(y,\mu)\) for several powers and compare the under- vs over-prediction penalty.

  2. Construct a dataset with constant targets and see what happens to \(D^2\) (baseline deviance \(=0\)).

  3. Add sample weights and verify that the baseline prediction becomes the weighted mean.

  4. Simulate data from a Poisson model and compare \(D^2\) for powers \(p=0\) vs \(p=1\).

References#

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.d2_tweedie_score.html

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_tweedie_deviance.html

  • Hastie, Tibshirani, Wainwright (2015), Statistical Learning with Sparsity, Eq. (3.11)